Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
brian-dellabetta
left a comment
There was a problem hiding this comment.
sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look
sorry, i approved this thinking it was the one-liner removing clear-ml, will have to take a closer look
brian-dellabetta
left a comment
There was a problem hiding this comment.
I am understanding this for the most part -- very cool!
cf09876 to
72e7683
Compare
def DoNotOffloadContext():
to_offload = {}
def patched(module):
to_offload.add(module)
return None
with patch_attr(AlignDeviceHook, "post_forward", patched)
yield
# offload on exit
for module in to_offload:
module.post_forward()
for subgraph in subgraphs():
with DoNotOffloadContext():
subgraph(**inputs) |
382c3e6 to
7586733
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
brian-dellabetta
left a comment
There was a problem hiding this comment.
awesome stuff!
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
## Purpose ## * Speed up tests by reducing device movement ## Background ## As of #1263, the model is dispatched to different device maps depending on which pipelines are used. If the model starts on anything but the CPU, then these dispatches and undispatches create device movement. Starting on the CPU will ensure no device movement occurs when offloaded dispatches happen. Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
## Purpose ## * Speed up tests by reducing device movement ## Background ## As of #1263, the model is dispatched to different device maps depending on which pipelines are used. If the model starts on anything but the CPU, then these dispatches and undispatches create device movement. Starting on the CPU will ensure no device movement occurs when offloaded dispatches happen. Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
…ssed (vllm-project#1530) ## Purpose ## * Fix failing examples ## Changes ## * Save model after generation in all examples * Previously, models would be saved before generation, causing generation to fail because we do not fully support generating with compressed models atm ## Future ## * In the future, we can define a better API around compressing and decompressing models which does not require so many arguments * In the future, we can standardize around reloading (and redispatching) the model before generation, as suggested here vllm-project#1263 * In the future, we can remove the sample generation step Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
# Sequential Onloading # <p align="center"><img width="403" alt="Screenshot 2025-06-05 at 22 53 01" src="https://github.com/user-attachments/assets/ffd610ac-c511-4dc1-b858-b0ed2bf95193" /></p> ``` (25/33): Calibrating: 0%| | 0/512 [00:00<?, ?it/s] <class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> cuda <class 'torch.nn.modules.linear.Linear'>.weight -> cuda <class 'torch.nn.modules.linear.Linear'>.weight_scale -> cuda <class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> cuda ... (25/33): Calibrating: 100%|█████| 512/512 [00:23<00:00, 21.91it/s] 2025-06-03T17:29:15.536963-0400 | compress_modules | INFO - Quantizing model.layers.24.self_attn.q_proj using 512 samples 2025-06-03T17:29:17.328720-0400 | compress | METRIC - time 1.79s 2025-06-03T17:29:17.329265-0400 | compress | METRIC - error 8948.54 2025-06-03T17:29:17.329781-0400 | compress | METRIC - GPU 0 | usage: 5.41% | total memory: 85 GB 2025-06-03T17:29:17.330248-0400 | compress | METRIC - Compressed module size: 33.947648 MB ... (25/33): Propagating: 100%|█████| 512/512 [00:03<00:00, 131.16it/s] <class 'transformers.models.llama.modeling_llama.LlamaRMSNorm'>.weight -> meta <class 'torch.nn.modules.linear.Linear'>.weight -> meta <class 'torch.nn.modules.linear.Linear'>.weight_scale -> meta <class 'torch.nn.modules.linear.Linear'>.weight_zero_point -> meta ... ``` ## Purpose ## * Reduce hardware requirements for calibrating large models * Reduce runtime caused by excess device movement when calibrating offloaded models ## Prerequisites ## * vllm-project/compressed-tensors#354 * vllm-project/compressed-tensors#355 * vllm-project/compressed-tensors#356 * vllm-project/compressed-tensors#357 ## Related Issues ## * Resolves vllm-project#1383 * Resolves vllm-project#1228 * Resolves vllm-project#1122 * Resolves vllm-project#1078 * Resolves vllm-project#1216 * Resolves vllm-project#1483 ## Changes ## * Keep layer parameters onloaded during the entire sequential calibration + compression + propagation step * This is achieved through the `keep_onload_context`, which disables offloaded until the context is exited * Dispatch model within each calibration pipeline * Sequential pipeline offloads the model to CPU, and executes on the first cuda device * Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument * Use sequential pipeline as default pipeline (basic pipeline is never used) * Deprecate passing sequential_targets via modifiers, instead prefer passing via oneshot argument * Dispatch model before sample generation * The model is dispatched exactly as it would be if it was loaded with `device_map="auto"` ### Examples ### * Models are loaded onto cpu before oneshot (rather than being dispatched across GPUs) * Model is reloaded from disk in order to redispatch onto "auto" device map * In my opinion, this is a better flow anyways, since models can raise errors / take a very long time during generation, which can cause the entirely compression job to go to waste * The alternative is to either call `accelerate.remove_hooks(model)` and `accelerate.dispatch_model(model)` before generating, or get rid of sample generation entirely. One of these may be required if compressed_linear isn't reliable enough to add to our examples <details><summary>New example script</summary> ```python3 from transformers import AutoModelForCausalLM from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot from llmcompressor.utils.dev import dispatch_for_generation # Load model (on cpu) model_id = "meta-llama/Meta-Llama-3-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") # model is loaded on cpu tokenizer = AutoTokenizer.from_pretrained(model_id) # Define recipe recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) # Apply oneshot (model execution device is set to cuda, model stays on cpu) oneshot( model=model, dataset="ultrachat_200k", recipe=recipe, max_seq_length=2048, num_calibration_samples=512, ) # Perform sample generation print("\n\n") print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") output = model.generate(input_ids, max_new_tokens=100) print(tokenizer.decode(output[0])) print("==========================================\n\n") # Save to disk before generating SAVE_DIR = model_id.split("/")[1] + "-W4A16-G128" model.save_pretrained(SAVE_DIR, save_compressed=True) tokenizer.save_pretrained(SAVE_DIR) ``` </details> ## Testing ## * Calibrated and GPTQ-compressed one layer of Deepseek-V3 with a single H100 in 50 seconds * 4.5x Improvement over original 236 seconds * Peak memory of ~40 GB, which can be further reduced by increasing the granularity of sequential targets * Not offloading activations did not result in a performance improvement * TODO: Test all example models can be reloaded and run --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Brian Dellabetta <bdellabe@redhat.com>
## Purpose ## * Speed up tests by reducing device movement ## Background ## As of vllm-project#1263, the model is dispatched to different device maps depending on which pipelines are used. If the model starts on anything but the CPU, then these dispatches and undispatches create device movement. Starting on the CPU will ensure no device movement occurs when offloaded dispatches happen. Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

Sequential Onloading
Purpose
Prerequisites
force_cpu_offloadcompressed-tensors#354offloaded_dispatch, implementdisable_offloadingcompressed-tensors#355register_offload_parametercompressed-tensors#356offloaded_dispatchcompressed-tensors#357Related Issues
Changes
keep_onload_context, which disables offloaded until the context is exiteddevice_map="auto"Examples
accelerate.remove_hooks(model)andaccelerate.dispatch_model(model)before generating, or get rid of sample generation entirely. One of these may be required if compressed_linear isn't reliable enough to add to our examplesNew example script
Testing